To recap, in Assignment 1, I (1) analysed information about the dataset I have chosen, and its associated publication, (2) mapped the expression data to HUGO gene symbols, (3) filtered out unncecessary genes, and (4) normalized the dataset. The following subsections go over these points in a bit more detail, and are taken from my Assignment 1. In addition, the code from my .rmd file includes code from A1- in order to regenerate the normalized counts matrix.
The expression dataset that I had chosen is GSE221253, taken from the GEO expression data repository. The study associated with this dataset involved adoptive cell therapy. Adoptive cell therapy via tumor infiltrating lymphocytes (ACT-TIL) is a type of immunotherapy used to treat cancer. This study took samples from 13 patients having metastatic melanoma and performed RNAseq analysis, along with other analyses like spatial proteomics and scRNAseq, on tumor tissues before and after the cell therapy (pre- and post-ACT) in order to gain a better understanding of the interactions and cell states within the tumor microenvironment throughout treatment. For the RNAseq experiment, this resulted in a total of 26 samples, and 13 in each condition (two for each patient, and the experimental conditions being pre- and post-ACT treatment).
The first step from Assignment 1 was to map the expression data to HUGO gene symbols. Only 940 out of 19117 genes from our original gene expression raw file were unable to be mapped to HUGO gene symbols. This is actually not that bad!
The next step in Assignment 1 was to filter out the genes with low counts and normalize our data. Filteration of outlier genes with low gene counts removed ~3678 genes from the expression dataset. Then, TMM normalization was performed to normalize the data. The plots (pre and post normalization + filtration) are shown below:
To recap, the first step of Assignment 2 is to perform differential gene expression, in order to identify key genes that are differentially expressed between our subgroups. This involves (1) constructing our model design, (2) estimating the dispersion, (3) performing multiple hypothesis testing and using the Quasi likelihood model to calculate differential expression, and (4) generate a heatmap and MA plot of the data. Next, we performed threshold over-representation analysis
First, this involved defining our model design, and inspecting the MDS plot (below) to see if there was any clustering between our two groups (pre-ACT and post-ACT). It doesn’t seem that there is a lot of clustering amongst the pre-ACT and post-ACT samples. However, the study associated with our data set aims to understand interactions and cell states within the tumor microenvironment throughout treatment, so it makes the most sense to define the group for our model design as pre- and post-ACT treatment.
Here, we estimate the dispersion, fit the model to our specified model design, and then calculate differential expression using the Quasi liklihood model. The table below shows the top gene hits.
|
|
|
|
In the below code blocks, we compile all of the results, and find how many genes pass the threshold value, and how many genes pass correction.
How many genes pass the p-value threshold p < 0.05?
## [1] 4684
Now, let us see how many genes pass correction for multiple hypothesis testing. ~25% of the genes pass correction. This is a decent number, as it is not too much or not too little genes.
## [1] 1643
Here is an MA plot that displays the differential gene expression
between our two groups of samples: pre- and post- ACT.
The following codeblock generates a heatmap of our data, along with
annotations, to see how the data seems to cluster. Based on what is
shown in the heatmap below, there does seem to be some minimal
clustering of genes that are overexpressed in some pre-ACT samples.
The last step of this assignment is to perform thresholded over-representation analysis (ORA). Essentially, our main goal here is to see whether our upregulated or downregulated genes all share some kind of common characteristic (maybe some of them are from the same pathway, etc). By doing this, we can then make conclusions about what kind of genes are overexpressed/underexpressed and ensure that it aligns with literature.
Here, we define our upregulated and downregulated genes. Upregulated genes are defined here to have a p-value of 0.05 and positive fold change. Downregulated genes are defined here to have a p-value of 0.05 and negative fold change.
## [1] "Number of upregulated genes"
## [1] 1170
## [1] "Number of downregulated genes"
## [1] 473
## [1] "Number of significant genes in total"
## [1] 1643
Here, we run run g:profiler on the set of all significant genes that have a p-value of < 0.05 to perform a gene set enrichment analysis.
| query | significant | p_value | term_size | query_size | intersection_size | precision | recall | term_id | source | term_name | effective_domain_size | source_order | parents |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| query_1 | TRUE | 6.630947e-07 | 1494 | 1393 | 199 | 0.14285714 | 0.1331995 | GO:0044281 | GO:BP | small molecule metabolic process | 16178 | 11317 | GO:0008152 |
| query_1 | TRUE | 2.189967e-05 | 2145 | 1393 | 257 | 0.18449390 | 0.1198135 | GO:0009056 | GO:BP | catabolic process | 16178 | 3269 | GO:0008152 |
| query_1 | TRUE | 5.598736e-05 | 4941 | 1393 | 517 | 0.37114142 | 0.1046347 | GO:1901564 | GO:BP | organonitrogen compound metabolic process | 16178 | 22098 | GO:00068…. |
| query_1 | TRUE | 7.868727e-05 | 267 | 1393 | 51 | 0.03661163 | 0.1910112 | GO:0016053 | GO:BP | organic acid biosynthetic process | 16178 | 5145 | GO:00060…. |
| query_1 | TRUE | 9.388388e-05 | 764 | 1393 | 109 | 0.07824838 | 0.1426702 | GO:0006082 | GO:BP | organic acid metabolic process | 16178 | 1967 | GO:00442…. |
| query_1 | TRUE | 9.388388e-05 | 264 | 1393 | 50 | 0.03589375 | 0.1893939 | GO:0046394 | GO:BP | carboxylic acid biosynthetic process | 16178 | 12522 | GO:00160…. |
In the codeblock below, we run g:profiler on the set of upregulated genes to perform a gene set enrichment analysis.
| query | significant | p_value | term_size | query_size | intersection_size | precision | recall | term_id | source | term_name | effective_domain_size | source_order | parents |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| query_1 | TRUE | 0.004049154 | 88 | 997 | 18 | 0.018054162 | 0.20454545 | GO:0042775 | GO:BP | mitochondrial ATP synthesis coupled electron transport | 16178 | 10709 | GO:0042773 |
| query_1 | TRUE | 0.004049154 | 4941 | 997 | 368 | 0.369107322 | 0.07447885 | GO:1901564 | GO:BP | organonitrogen compound metabolic process | 16178 | 22098 | GO:00068…. |
| query_1 | TRUE | 0.004049154 | 88 | 997 | 18 | 0.018054162 | 0.20454545 | GO:0042773 | GO:BP | ATP synthesis coupled electron transport | 16178 | 10707 | GO:00061…. |
| query_1 | TRUE | 0.004049154 | 8726 | 997 | 605 | 0.606820461 | 0.06933303 | GO:0044238 | GO:BP | primary metabolic process | 16178 | 11300 | GO:0008152 |
| query_1 | TRUE | 0.004049154 | 428 | 997 | 51 | 0.051153460 | 0.11915888 | GO:0007005 | GO:BP | mitochondrion organization | 16178 | 2648 | GO:0006996 |
| query_1 | TRUE | 0.004049154 | 6 | 997 | 5 | 0.005015045 | 0.83333333 | GO:0036115 | GO:BP | fatty-acyl-CoA catabolic process | 16178 | 9776 | GO:00091…. |
| query_1 | TRUE | 0.004049154 | 46 | 997 | 13 | 0.013039117 | 0.28260870 | GO:0006120 | GO:BP | mitochondrial electron transport, NADH to ubiquinone | 16178 | 1998 | GO:00196…. |
| query_1 | TRUE | 0.004049154 | 160 | 997 | 26 | 0.026078235 | 0.16250000 | GO:0009060 | GO:BP | aerobic respiration | 16178 | 3273 | GO:0045333 |
| query_1 | TRUE | 0.004244812 | 8309 | 997 | 579 | 0.580742227 | 0.06968348 | GO:0006807 | GO:BP | nitrogen compound metabolic process | 16178 | 2516 | GO:0008152 |
| query_1 | TRUE | 0.004266470 | 9757 | 997 | 666 | 0.668004012 | 0.06825869 | GO:0008152 | GO:BP | metabolic process | 16178 | 3159 | GO:0008150 |
| query_1 | TRUE | 0.004353541 | 99 | 997 | 19 | 0.019057172 | 0.19191919 | GO:0022904 | GO:BP | respiratory electron transport chain | 16178 | 6868 | GO:00229…. |
| query_1 | TRUE | 0.004460309 | 186 | 997 | 28 | 0.028084253 | 0.15053763 | GO:0045333 | GO:BP | cellular respiration | 16178 | 11755 | GO:0015980 |
| query_1 | TRUE | 0.004460309 | 490 | 997 | 55 | 0.055165496 | 0.11224490 | GO:0061919 | GO:BP | process utilizing autophagic mechanism | 16178 | 16215 | GO:0009987 |
| query_1 | TRUE | 0.004460309 | 490 | 997 | 55 | 0.055165496 | 0.11224490 | GO:0006914 | GO:BP | autophagy | 16178 | 2592 | GO:00442…. |
| query_1 | TRUE | 0.004460309 | 59 | 997 | 14 | 0.014042126 | 0.23728814 | GO:0072332 | GO:BP | intrinsic apoptotic signaling pathway by p53 class mediator | 16178 | 17869 | GO:00723…. |
| query_1 | TRUE | 0.005219212 | 61 | 997 | 14 | 0.014042126 | 0.22950820 | GO:0010257 | GO:BP | NADH dehydrogenase complex assembly | 16178 | 4072 | GO:0065003 |
| query_1 | TRUE | 0.005219212 | 61 | 997 | 14 | 0.014042126 | 0.22950820 | GO:0032981 | GO:BP | mitochondrial respiratory chain complex I assembly | 16178 | 8243 | GO:00102…. |
| query_1 | TRUE | 0.005971288 | 1168 | 997 | 107 | 0.107321966 | 0.09160959 | GO:0009057 | GO:BP | macromolecule catabolic process | 16178 | 3270 | GO:00431…. |
| query_1 | TRUE | 0.007634004 | 2145 | 997 | 176 | 0.176529589 | 0.08205128 | GO:0009056 | GO:BP | catabolic process | 16178 | 3269 | GO:0008152 |
| query_1 | TRUE | 0.007634004 | 107 | 997 | 19 | 0.019057172 | 0.17757009 | GO:0022900 | GO:BP | electron transport chain | 16178 | 6867 | GO:0006091 |
Here, we do the same for the set of downregulated genes, and perform a gene set enrichment analysis on them.
| query | significant | p_value | term_size | query_size | intersection_size | precision | recall | term_id | source | term_name | effective_domain_size | source_order | parents |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| query_1 | TRUE | 1.545313e-14 | 1494 | 396 | 94 | 0.23737374 | 0.06291834 | GO:0044281 | GO:BP | small molecule metabolic process | 16178 | 11317 | GO:0008152 |
| query_1 | TRUE | 2.470354e-14 | 764 | 396 | 63 | 0.15909091 | 0.08246073 | GO:0006082 | GO:BP | organic acid metabolic process | 16178 | 1967 | GO:00442…. |
| query_1 | TRUE | 1.688262e-13 | 758 | 396 | 61 | 0.15404040 | 0.08047493 | GO:0043436 | GO:BP | oxoacid metabolic process | 16178 | 11048 | GO:0006082 |
| query_1 | TRUE | 1.688262e-13 | 739 | 396 | 60 | 0.15151515 | 0.08119080 | GO:0019752 | GO:BP | carboxylic acid metabolic process | 16178 | 6252 | GO:0043436 |
| query_1 | TRUE | 5.379899e-11 | 299 | 396 | 34 | 0.08585859 | 0.11371237 | GO:0044282 | GO:BP | small molecule catabolic process | 16178 | 11318 | GO:00090…. |
| query_1 | TRUE | 5.379899e-11 | 24 | 396 | 12 | 0.03030303 | 0.50000000 | GO:0042730 | GO:BP | fibrinolysis | 16178 | 10679 | GO:0030195 |
Here, we present the number of gene sets returned from both the downregulated and upregulated analyses:
## [1] "How many genesets were returned from downregulated analysis?"
## [1] 5057
## [1] "How many genesets were returned from upregulated analysis"
## [1] 7844
The authors in the original paper concluded that in samples after Adoptive Cell Therapy, there was an increase in the strength and number of tumor infiltrating lymphocytes (TILSs), as well as an increase in interferon-activated myeloid T-cells (Barras 2024). This suggests that there seems to be an overall improved immune response towards the tumor after cell therapy. Upon looking at the over-representation results for down-regulated and up-regulated genes, however, there does not seem to be much correlation between the papers findings and the over-representation results. The overrepresentation results list common metabolic process such as “small molecule metabolic process” and “mitochondrial ATP synthesis coupled electron transport” as top hits, but these are not indicative of anything as they are very general/broad processes.
In this section, we begin assignment 3. The first step is to perform non-threshold Gene set Enrichment Analysis. Next, we will visualize our GSEA using Cytoscape. Lastly, we will interpret and have a more granular understanding of our results.
There are a lot of existing gene set analysis algorithm that are non-thresholded, but we will be using GSEA, as it is very popular, and well trusted throughout literature (the number of citations it has tops all other methods). We will also be using gene sets from the BaderLab.
1.What method did you use? What genesets did you use? Make sure to specify versions and cite your methods. There are a lot of existing gene set analysis algorithm that are non-thresholded, but I decided to use GSEA, as it is very popular, and well trusted throughout literature (the number of citations it has tops all other methods). I also used gene sets from the BaderLab. I followed the methods outlined here
2.Summarize your enrichment results. Upon opening the index file outputted by GSEA, we can see that:
4263 / 5903 gene sets are upregulated in phenotype Pre- Adoptive Cell Therapy,
1049 gene sets are significant at FDR < 25%, 445 gene sets are significantly enriched at nominal pvalue < 1%,
944 gene sets are significantly enriched at nominal pvalue < 5%,
1640 / 5903 gene sets are upregulated in phenotype Post- Adoptive Cell Therapy,
682 gene sets are significantly enriched at FDR < 25%,
441 gene sets are significantly enriched at nominal pvalue < 1%,
605 gene sets are significantly enriched at nominal pvalue < 5%
3. How do these results compare to the results from the thresholded analysis in Assignment #2. Compare qualitatively. Is this a straight forward comparison? Why or why not? When performing thresholded over-representation analysis in Assignment 2, we first had to filter out the genes based on their p-values, to ensure that we only had significant up-regulated and down-regulated genes, resulting in an overall lower number of genes than GSEA. For GSEA, we end up using all of the genes, regardless of their p-value, to ensure that we dont loose any signal. So, no it is not really a straightforward comparison due to the different methodologies used (with a threshold vs. without one).
Our next step is to use our results from non-thresholded GSEA and visualize our results in Cytoscape. In these figures here, a node represents a gene set, and an edge represents genes in common between two gene sets. Red nodes represents gene sets specifically for the Pre-ACT group, and blue nodes represents gene sets specifically for the Post-ACT group. The three figures below show the main clusters present within the network.
We can actually visualize the different clusters of this network, and identify common themes/common gene sets (pathways) and biological processes. These cluster themes for each of the above figures are presented below (in the above order):
1.Create an enrichment map - how many nodes and how many edges in the resulting map? What thresholds were used to create this map? Make sure to record all thresholds. Include a screenshot of your network prior to manual layout. Enrichment map is created above in the above figures. There are a total of 883 nodes and 6064 edges. To generate by map, I used a node cutoff threshold of Q-value = 0.01, and for my edge cutoff, I used a threshold value of Q-value = 0.375. I used default parameters for the rest of the fields.
2.Annotate your network - what parameters did you use to annotate the network. If you are using the default parameters make sure to list them as well. For my annotated network (figures above), I decided to annotate each gene set with its pathway name. The rest of the parameters I used for this were default.
3.Make a publication ready figure - include this figure with proper legends in your notebook. My publication ready figure with a figure legend is below: